tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.
I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.
During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:
Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
Single entry point for common operations - one function dispatching on given arguments.
group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
Most operations recognize regular dataset and grouped dataset and process data accordingly.
One function form to enable thread-first on dataset.
If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:
Dataset is a special type which can be considered as a map of columns implemented around tech.ml.datatype library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.
Dataset creation
Dataset can be created from various of types of Clojure structures and files:
single values
sequence of maps
map of sequences or values
sequence of columns (taken from other dataset or created manually)
sequence of pairs: [string column-data] or [keyword column-data]
array of any arrays
file types: raw/gzipped csv/tsv, json, xls(x) taken from local file system or URL
input stream
tc/dataset accepts:
data
options (see documentation of tech.ml.dataset/->dataset function for full list):
:dataset-name - name of the dataset
:num-rows - number of rows to read from file
:header-row? - indication if first row in file is a header
:key-fn - function applied to column names (eg. keyword, to convert column names to keywords)
:separator - column separator
:single-value-column-name - name of the column when single value is provided
:column-names - in case you want to name columns - only works for sequential input (arrays) or empty dataset
:layout - for numerical, native array of arrays - treat entries :as-rows or :as-columns (default)
tc/let-dataset accepts bindings symbol-column-data to simulate R’s tibble function. Each binding is converted into a column. You can refer previous columns to in further bindings (as in let).
Empty dataset.
(tc/dataset)
loading …
Empty dataset with column names
(tc/dataset nil {:column-names [:a:b]})
loading …
Sequence of pairs (first = column name, second = value(s)).
(tc/dataset [[:A33] [:B5] [:C:a]])
loading …
Not sequential values are repeated row-count number of times.
(tc/dataset [[:A [123456]] [:B"X"] [:C:a]])
loading …
Dataset created from map (keys = column names, vals = value(s)). Works the same as sequence of pairs.
Get single value from the table using get-in from Clojure API or get-entry. First argument is column name, second is row number.
(get-in ds ["wind"2])
2.3
(tc/get-entry ds "wind"2)
2.3
Printing
Dataset is printed using dataset->str or print-dataset functions. Options are the same as in tech.ml.dataset/dataset-data->str. Most important is :print-line-policy which can be one of the: :single, :repl or :markdown.
Grouping by is an operation which splits dataset into subdatasets and pack it into new special type of… dataset. I distinguish two types of dataset: regular dataset and grouped dataset. The latter is the result of grouping.
Grouped dataset is annotated in by :grouped? meta tag and consist following columns:
:name - group name or structure
:group-id - integer assigned to the group
:data - groups as datasets
Almost all functions recognize type of the dataset (grouped or not) and operate accordingly.
You can’t apply reshaping or join/concat functions on grouped datasets.
Grouping
Grouping is done by calling group-by function with arguments:
ds - dataset
grouping-selector - what to use for grouping
options:
:result-type - what to return:
:as-dataset (default) - return grouped dataset
:as-indexes - return rows ids (row number from original dataset)
:as-map - return map with group names as keys and subdataset as values
:as-seq - return sequens of subdatasets
:select-keys - list of the columns passed to a grouping selector function
All subdatasets (groups) have set name as the group name, additionally group-id is in meta.
Grouping can be done by:
single column name
seq of column names
map of keys (group names) and row indexes
value returned by function taking row as map (limited to :select-keys)
Note: currently dataset inside dataset is printed recursively so it renders poorly from markdown. So I will use :as-seq result type to show just group names and groups.
List of columns in grouped dataset
(-> DS (tc/group-by :V1) (tc/column-names))
(:V1:V2:V3:V4)
List of columns in grouped dataset treated as regular dataset
You can group by a result of grouping function which gets row as map and should return group name. When map is used as a group name, ungrouping restore original column names.
Ungrouping simply concats all the groups into the dataset. Following options are possible
:order? - order groups according to the group name ascending order. Default: false
:add-group-as-column - should group name become a column? If yes column is created with provided name (or :$group-name if argument is true). Default: nil.
:add-group-id-as-column - should group id become a column? If yes column is created with provided name (or :$group-id if argument is true). Default: nil.
:dataset-name - to name resulting dataset. Default: nil (_unnamed)
If group name is a map, it will be splitted into separate columns. Be sure that groups (subdatasets) doesn’t contain the same columns already.
If group name is a vector, it will be splitted into separate columns. If you want to name them, set vector of target column names as :add-group-as-column argument.
After ungrouping, order of the rows is kept within the groups but groups are ordered according to the internal storage.
Grouping and ungrouping.
(-> DS (tc/group-by :V3) (tc/ungroup))
loading …
Groups sorted by group name and named.
(-> DS (tc/group-by :V3) (tc/ungroup {:order? true:dataset-name"Ordered by V3"}))
loading …
Groups sorted descending by group name and named.
(-> DS (tc/group-by :V3) (tc/ungroup {:order? :desc:dataset-name"Ordered by V3 descending"}))
To check if dataset is grouped or not just use grouped? function.
(tc/grouped? DS)
nil
(tc/grouped? (tc/group-by DS :V1))
true
If you want to remove grouping annotation (to make all the functions work as with regular dataset) you can use unmark-group or as-regular-dataset (alias) functions.
It can be important when you want to remove some groups (rows) from grouped dataset using drop-rows or something like that.
If you want to implement your own mapping function on grouped dataset you can call process-group-data and pass function operating on datasets. Result should be a dataset to have ungrouping working.
Column is a special tech.ml.dataset structure based on tech.ml.datatype library. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.
Type of the data is inferred from a sequence during column creation.
Names
To select dataset columns or column names columns-selector is used. columns-selector can be one of the following:
:all keyword - selects all columns
column name - for single column
sequence of column names - for collection of columns
regex - to apply pattern on column names or datatype
filter predicate - to filter column names or datatype
type namespaced keyword for specific datatype or group of datatypes
Column name can be anything.
column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following:
:name (default) - to operate on column names
:datatype - to operated on column types
:all - if you want to process all metadata
Datatype groups are:
:type/numerical - any numerical type
:type/float - floating point number (:float32 and :float64)
:type/integer - any integer
:type/datetime - any datetime type
If qualified keyword starts with :!type, complement set is used.
To select all column names you can use column-names function.
(tc/column-names DS)
(:V1:V2:V3:V4)
or
(tc/column-names DS :all)
(:V1:V2:V3:V4)
In case you want to select column which has name :all (or is sequence or map), put it into a vector. Below code returns empty sequence since there is no such column in the dataset.
(tc/column-names DS [:all])
()
Obviously selecting single name returns it’s name if available
(tc/column-names DS :V1)
(:V1)
(tc/column-names DS "no such column")
()
Select sequence of column names.
(tc/column-names DS [:V1"V2":V3:V4:V5])
(:V1:V3:V4)
Select names based on regex, columns ends with 1 or 4
(tc/column-names DS #".*[14]")
(:V1:V4)
Select names based on regex operating on type of the column (to check what are the column types, call (tc/info DS :columns). Here we want to get integer columns only.
(tc/column-names DS #"^:int.*":datatype)
(:V1:V2)
or
(tc/column-names DS :type/integer)
(:V1:V2)
And finally we can use predicate to select names. Let’s select double precision columns.
(tc/column-names DS #{:float64} :datatype)
(:V3)
or
(tc/column-names DS :type/float64)
(:V3)
If you want to select all columns but given, use complement function. Works only on a predicate.
You can select column names based on all column metadata at once by using :all metadata selector. Below we want to select column names ending with 1 which have long datatype.
To add (or replace existing) column call add-column function. Function accepts:
ds - a dataset
column-name - if it’s existing column name, column will be replaced
column - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy argument)
size-strategy (optional) - when new column is shorter than dataset row count, following strategies are applied:
:cycle - repeat data
:na - append missing values
:strict - (default) throws an exception when sizes mismatch
The other way of creating or updating column is to map rows as regular map function. The arity of mapping function should be the same as number of selected columns.
To convert column into given datatype can be done using convert-types function. Not all the types can be converted automatically also some types require slow parsing (every conversion from string). In case where conversion is not possible you can pass conversion function.
Arguments:
ds - dataset
Two options:
coltype-map in case when you want to convert several columns, keys are column names, vals are new types
column-selector and new-types - column name and new datatype (or datatypes as sequence)
new-types can be:
a type like :int64 or :string or sequence of types
or sequence of pair of datetype and conversion function
After conversion additional infomation is given on problematic values.
The other conversion is casting column into java array (->array) of the type column or provided as argument. Grouped dataset returns sequence of arrays.
You can also cast the type to the other one (if casting is possible):
(tc/->array DS :V4:string)
["A", "B", "C", "A", "B", "C", "A", "B", "C"]
(tc/->array DS :V1:float32)
[1.0, 2.0, 1.0, 2.0, 1.0, 2.0, 1.0, 2.0, 1.0]
Rows
Rows can be selected or dropped using various selectors:
row id(s) - row index as number or seqence of numbers (first row has index 0, second 1 and so on)
sequence of true/false values
filter by predicate (argument is row as a map)
When predicate is used you may want to limit columns passed to the function (select-keys option).
Additionally you may want to precalculate some values which will be visible for predicate as additional columns. It’s done internally by calling add-columns on a dataset. :pre is used as a column definitions.
Select
Select fifth row
(tc/select-rows DS 4)
loading …
Select 3 rows
(tc/select-rows DS [145])
loading …
Select rows using sequence of true/false values
(tc/select-rows DS [truenilniltrue])
loading …
Select rows using predicate
(tc/select-rows DS (comp #(<%1) :V3))
loading …
The same works on grouped dataset, let’s select first row from every group.
Aggregating is a function which produces single row out of dataset.
Aggregator is a function or sequence or map of functions which accept dataset as an argument and result single value, sequence of values or map.
Where map is given as an input or result, keys are treated as column names.
Grouped dataset is ungrouped after aggreation. This can be turned off by setting :ungroup to false. In case you want to pass additional ungrouping parameters add them to the options.
By default resulting column names are prefixed with summary prefix (set it with :default-column-name-prefix option).
Let’s calculate mean of some columns
(tc/aggregate DS #(reduce+ (%:V2)))
loading …
Let’s give resulting column a name.
(tc/aggregate DS {:sum-of-V2 #(reduce+ (%:V2))})
loading …
Sequential result is spread into separate columns
(tc/aggregate DS #(take5(%:V2)))
loading …
You can combine all variants and rename default prefix
You can perform columnar aggreagation also. aggregate-columns selects columns and apply aggregating function (or sequence of functions) for each column separately.
Cross tabulation built from two sets of columns. First rows and cols are used to construct grouped dataset, then aggregation function is applied for each pair. By default it counts rows from each group.
Options are:
:aggregator - function which aggregates values of grouped dataset, default it’s row-count
:marginal-rows and :marginal-cols - if true, sum of rows and cols are added as an additional columns and row. May be custom function which accepts pure row and col as a seq.
:replace-missing? - should missing values be replaced (default: true) with :missing-value (default: 0)
:pivot? - if false, flat aggregation result is returned (default: false)
Custom comparator also can be used in case objects are not comparable by default. Let’s define artificial one: if Euclidean distance is lower than 2, compare along z else along x and y. We use first three columns for that.
Remove rows which contains the same data. By default unique-by removes duplicates from whole dataset. You can also pass list of columns or functions (similar as in group-by) to remove duplicates limited by them. Default strategy is to keep the first row. More strategies below.
unique-by works on groups
Remove duplicates from whole dataset
(tc/unique-by DS)
loading …
Remove duplicates from each group selected by column.
(tc/unique-by DS :V1)
loading …
Pair of columns
(tc/unique-by DS [:V1:V3])
loading …
Also function can be used, split dataset by modulo 3 on columns :V2
Missing values can be replaced using several strategies. replace-missing accepts:
dataset
column selector, default: :all
strategy, default: :nearest
value (optional)
single value
sequence of values (cycled)
function, applied on column(s) with stripped missings
map with [index,value] pairs
Strategies are:
:value - replace with given value
:up - copy values up
:down - copy values down
:updown - copy values up and then down for missing values at the end
:downup - copy values down and then up for missing values at the beginning
:mid or :nearest - copy values around known values
:midpoint - use average value from previous and next non-missing
:lerp - trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.
To pack or unpack the data into single value you can use fold-by and unroll functions.
fold-by groups dataset and packs columns data from each group separately into desired datastructure (like vector or sequence). unroll does the opposite.
unroll unfolds sequences stored in data, multiplying other ones when necessary. You can unroll more than one column at once (folded data should have the same size!).
Options:
:indexes? if true (or column name), information about index of unrolled sequence is added.
:datatypes list of datatypes which should be applied to restored columns, a map
When column names contain observation data, such column names can be splitted and data can be restored into separate columns.
(def who (tc/dataset "data/who.csv.gz"))
(->> who (tc/column-names) (take10) (tc/select-columns who))
loading …
(tc/pivot->longer who #(clojure.string/starts-with? %"new") {:target-columns [:diagnosis:gender:age]:splitter#"new_?(.*)_(.)(.*)":value-column-name:count})
loading …
When data contains multiple observations per row, we can use splitter and pattern for target columns to create new columns and put values there. In following dataset we have two obseravations dob and gender for two childs. We want to put child infomation into the column and leave dob and gender for values.
(def family (tc/dataset "data/family.csv"))
family
loading …
(tc/pivot->longer family (complement #{"family"}) {:target-columns [nil:child]:splitter"_":datatypes {"gender":int16}})
loading …
Similar here, we have two observations: x and y in four groups.
columns-selector - values from selected columns are converted to new columns
value-columns - what are values
When multiple columns are used as columns selector, names are joined using :concat-columns-with option. :concat-columns-with can be a string or function (default: “_“). Function accepts sequence of names.
When columns-selector creates non unique set of values, they are folded using :fold-fn (default: vec) option.
When value-columns is a sequence, multiple observations as columns are created appending value column names into new columns. Column names are joined using :concat-value-with option. :concat-value-with can be a string or function (default: “-”). Function accepts current column name and value.
Use station as a name source for columns and seen for values
(def fish (tc/dataset "data/fish_encounters.csv"))
fish
loading …
(tc/pivot->wider fish "station""seen" {:drop-missing? false})
loading …
If selected columns contain multiple values, such values should be folded.
Joins accept left-side and right-side datasets and columns selector. Options are the same as in tech.ml.dataset functions.
A column selector can be a map with :left and :right keys to specify column names separate for left and right dataset.
The difference between tech.ml.dataset join functions are: arguments order (first datasets) and possibility to join on multiple columns.
Multiple columns joins create temporary index column from column selection. The method for creating index is based on :hashing option and defaults to identity. Prior to 7.000-beta-50hash function was used, which caused hash collision for certain cases.
Additionally set operations are defined: intersect and difference.
To concat two datasets rowwise you can choose:
concat - concats rows for matching columns, the number of columns should be equal.
union - like concat but returns unique values
bind - concats rows add missing, empty columns
To add two datasets columnwise use bind. The number of rows should be equal.
When :hashing option is used, data from join columns are preprocessed by applying join-columns funtion with :result-type set to the value of :hashing. This helps to create custom joining behaviour. Function used for hashing will get vector of row values from join columns.
In the following example we will join columns on value modulo 5.
(tc/left-join ds1 ds2 :b {:hashing (fn [[v]] (mod v 5))})
loading …
Cross
Cross product from selected columns
(tc/cross-join ds1 ds2 [:a:b])
loading …
(tc/cross-join ds1 ds2 {:left [:a:b] :right:e})
loading …
Expand
Similar to cross product but works on a single dataset.
(tc/expand ds2 :a:c:d)
loading …
Columns can be also bundled (nested) in tuples which are treated as a single entity during cross product.
(tc/expand ds2 [:a:c] [:e:b])
loading …
Complete
Same as expand with all other columns preserved (filled with missing values if necessary).
In ML world very often you need to test given model and prepare collection of train and test datasets. split creates new dataset with two additional columns:
:$split-name - with :train, :test, :split-2, … values
:$split-id - id of splitted group (for k-fold and repeating)
split-type can be one of the following:
:kfold (default) - k-fold strategy, :k defines number of folds (defaults to 5), produces k splits
:bootstrap - :ratio defines ratio of observations put into result (defaults to 1.0), produces 1 split
:holdout - split into two or more parts with given ratio(s) (defaults to 2/3), produces 1 split
:holdouts - splits into two parts for ascending ratio. Range of rations is given by steps option
:loo - leave one out, produces the same number of splits as number of observations
:holdout can accept also probabilites or ratios and can split to more than 2 subdatasets
Additionally you can provide:
:seed - for random number generator
:shuffle? - turn on/off shuffle of the rows (default: true)
:repeats - repeat procedure :repeats times
:partition-selector - same as in group-by for stratified splitting to reflect dataset structure in splits.
:split-names names of subdatasets different than default, ie. [:train :test :split-2 ...]
:split-col-name - a column where name of split is stored, either :train or :test values (default: :$split-name)
:split-id-col-name - a column where id of the train/test pair is stored (default: :$split-id)
In case of grouped dataset each group is processed separately.
tablecloth.pipeline exports special versions of API which create functions operating only on dataset. This creates the possibility to chain operations and compose them easily.
There are two ways to create pipelines:
functional, as a composition of functions
declarative, separating task declarations and concrete parametrization.
Pipeline operations are prepared to work with metamorph library. That means that result of the pipeline is wrapped into a map and dataset is stored under :metamorph/data key.
Warning: Duplicated metamorph pipeline functions are removed from tablecloth.pipeline namespace.
Functions
This API doesn’t provide any statistical, numerical or date/time functions. Use below namespaces:
Below you can find comparizon between functionality of data.table and Clojure dataset API. I leave it without comments, please refer original document explaining details:
How can we specify just the columns we would like to compute the mean() on?
R
kable(head(flights[carrier =="AA", ## Only on trips with carrier "AA"lapply(.SD, mean), ## compute the meanby = .(origin, dest, month), ## for every 'origin,dest,month'.SDcols =c("arr_delay", "dep_delay")])) ## for just those specified in .SDcols